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Abstract — This paper studies the sample complexity of search- 
ing over multiple populations. We consider a large number of 
populations, each corresponding to either distribution Po or Pi. 
The goal of the search problem studied here is to find one 
population corresponding to distribution Pi with as few samples 
as possible. The main contribution is to precisely quantify the 
number of samples needed to correctly find one such population. 
We consider two general approaches: non-adaptive sampling 
methods, which sample each population a predetermined number 
of times until a population following Pi is found, and adaptive 
sampling methods, which employ sequential sampling schemes 
for each population. We first derive a lower bound on the 
number of samples required by any sampling scheme. We then 
consider an adaptive procedure consisting of a series of sequential 
probability ratio tests, and show it comes within a small constant 
factor of the lower bound. We give explicit expressions for 
this constant when samples of the populations follow Gaussian 
and Bernoulli distributions. An alternative adaptive scheme is 
discussed which does not require full knowledge of Pi, and 
outperforms non-adaptive schemes. For comparison, a lower 
bound on the sampling requirements of any non-adaptive scheme 
is presented. 

Index Terms — Quickest search, rare events, SPRT, CUSTJM 
procedure, sparse recovery, sequential analysis, sequential thresh- 
olding, biased coin, spectrum sensing, multi-armed bandit. 

I. Introduction 

This paper studies the sample complexity of finding a 
population corresponding to some distribution Pi among a 
large number of populations corresponding to either distribu- 
tion P or Pi. More specifically let i = 1,2,... index the 
populations. Samples of each population follow one of two 
distributions, indicated by a binary label Xf. if = 0, then 
samples of population i follow distribution Po, if X{ = 1, then 
samples follow distribution Pi. We assume that Xi,X2, ■ ■ ■ 
are independently and identically distributed (i.i.d.) Bernoulli 
random variables with F(X t = 0) = 1-tt and ¥{X t = 1) = tt. 
Distribution Pi is termed the atypical distribution, which 
corresponds to atypical populations, and the probability tt 
quantifies the occurrence of such populations. The goal of the 
search problem studied here is to find an atypical population 
with as few samples as possible. 

In this search problem, populations are sampled a (deter- 
ministic or random) number of times, in sequence, until an 
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atypical population is found. The total number of samples 
needed is a function of the sampling strategy, the distributions, 
the required reliability, and tt. To build intuition, consider 
the following. As the occurrence of the atypical populations 
becomes infrequent, (i.e. as tt — > 0), the number of samples 
required to find one such population must, of course, increase. 
If Po and Pi are extremely different (e.g., non-overlapping 
supports), then a search procedure could simply proceed by 
taking one sample of each population until an atypical pop- 
ulation was found. The procedure would identify an atypical 
population with, on average, ir^ 1 samples. More generally, 
when the two distributions are more difficult to distinguish, as 
is the concern of this paper, we must take multiple samples 
of some populations. As the required reliability of the search 
increases, a procedure must also take more samples to confirm, 
with increasing certainty, that an atypical population has been 
found. 

The main contribution of this work is to precisely quantify 
the number of samples needed to correctly find one atypi- 
cal population. Specifically, we provide tight bounds on the 
expected number of samples required to find a population 
corresponding to Pi with a specified level of certainty. We 
pay additional attention to this sample complexity as tt be- 
comes small (and the occurrence of the atypical populations 
becomes rare). We consider two general approaches to find 
an atypical population, both of which sample populations in 
sequence. Non-adaptive procedures sample each population 
a predetermined number of times, make a decision, and if 
the null hypothesis is accepted then move on to the next 
population. Adaptive methods, in contrast, enjoy the flexibility 
to sample each population sequentially, and thus, the decision 
to continue sampling a particular population can be based on 
prior samples. 

The developments in this paper proceed as follows. First, 
using techniques from sequential analysis, we derive a lower 
bound on the expected number of samples needed to reliably 
identify an atypical population. To preview the results, the 
lower bound implies that any procedure (adaptive or non- 
adaptive) is unreliable if it uses fewer than tt~ 1 D(Pq\ |Pi) _1 
samples on average, where D(Pq\\Pi) is the Kullback-Leibler 
divergence. We then prove this is tight by showing that a series 
of sequential probability ratio tests (which we abbreviate as 
an S-SPRT) succeeds with high probability if the total number 



of samples is within a constant factor of the lower bound, 
provided a minor constraint on the log-likelihood statistic is 
satisfied (which holds for bounded distributions, Gaussian, 
exponential, among others). We give explicit expressions for 
this constant in the Gaussian and Bernoulli cases. In the 
Bernoulli case, the bound derived by instantiating our general 
results produces the tightest known bound. In many real 
world problems, insufficient knowledge of the distributions 
Pq and Pi makes implementing an S-SPRT impractical. To 
address this shortcoming, we propose a more practical adaptive 
procedure known as sequential thresholding, which doesn't 
require precise knowledge of Pi, and is particularly well suited 
for problems in which occurrence of an atypical population is 
rare. We show sequential thresholding is near-optimal when 
tt — > 0. Both the S-SPRT procedure and sequential thresh- 
olding are shown to be robust to imperfect knowledge of tt. 
Lastly, we show that non-adaptive procedures require at least 
■k~ 1 D(Pi\\Po)~ 1 log7r _1 samples to reliably find an atypical 
population, a factor of log7r _1 more samples when compared 
to adaptive methods. 



A. Motivating Applications 

Finding an atypical population arises in many relevant prob- 
lems in science and engineering. One of the main motivations 
for our work is the problem of spectrum sensing in cognitive 
radio. In cognitive radio applications, one is interested in 
finding a vacant radio channel among a potentially large 
number of occupied channels. Only once a vacant channel 
is identified can the cognitive device transmit, and thus, 
identifying a vacant channel as quickly as possible is of great 
interest. A number of works have looked at various adaptive 
methods for spectrum sensing in similar contexts, including 
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Another captivating example is the Search for Extrater- 
restrial Intelligence (SETI) project. Researchers at the SETI 
institute use large antenna arrays to sense for narrowband elec- 
tromagnetic energy from distant star systems, with the hopes 
of finding extraterrestrial intelligence with technology similar 
to ours. The search space consists of a virtually unlimited 
number of stars, over 100 billion in the Milky Way alone, each 
with 9 million potential "frequencies" in which to sense for 
narrow band energy. The prior probability of extraterrestrial 
transmission is indeed very small (SETI has yet to make 
a "contact"), and thus occurrence of atypical populations is 
rare. Roughly speaking, SETI employs a variable sample size 
search procedure that repeatedly tests energy levels against a 
threshold up to five times (5J, J6). If any of the measurements 
are below the threshold, the procedure immediately passes to 
the next frequency/star. This procedure is closely related to 
sequential thresholding [7|. Sequential thresholding results in 
substantial gains over fixed sample size procedures and, unlike 
the SPRT, it can be implemented without perfect knowledge 

of Pi. 



B. Related Work 

The prior work most closely related to the problem inves- 
tigated here is that by Lai, Poor, Xin, and Georgiadis [8|, in 
which the authors also examine the problem of quickest search 
across multiple populations, but do not focus on quantifying 
the sample complexity. The authors show that the S-SPRT 
(also termed a CUSUM test) optimizes a linear combination 
of the expected number of samples and the error probability. 
Complementary to this, our contributions include providing 
tight lower bounds on the expected number of samples re- 
quired to achieve a desired probability of error, and then 
showing the sample complexity of the S-SPRT comes within a 
small constant of this bound. This quantifies how the number 
of samples required to find an atypical population depends on 
the distributions Po and Pi and the probability tt, which was 
not explicitly investigated in [§]. As a by-product, this proves 
the optimality of the S-SPRT. 

An instance of the quickest search problem was also studied 
recently in j9), where the authors investigate the problem 
of finding a biased coin with the fewest flips. Our more 
general results are derived using different techniques, and 
cover this case with P and Pi as Bernoulli distributions. In 
j9j, the authors present a bound on the expected number of 
flips needed to find a biased coin. The bound derived from 
instantiating our more general theory (see example 2 and 
Corollary [8} is a minimum of 32 times tighter than the bound 
in (9). 

Also closely related is the problem of sparse signal support 
recovery from point-wise observations |7|, pO), (TT], classical 
work in optimal scanning theory fT2) , fl3~) , and work on pure 
exploration in multi-armed bandit problems p4) , fT5) . The 
sparse signal recovery problems differ in that the total number 
of populations is finite, and the objective is to recover all 
(or most) populations following Pi, as opposed to finding a 
single population and terminating the procedure. Traditional 
multi-armed bandit problems differ in that no knowledge of 
the distributions of the arms is assumed. 

II. Problem Setup 

Consider an infinite number of populations indexed by i = 
1, 2, .... For each population i, samples of that population are 
distributed either 

Y id ~ P if Xi = or 
Y id ~ Pi if Xi = 1 

where Po an d Pi are probability measures supported on y, j 
indexes multiple i.i.d. samples of a particular population, and 
Xi is a binary label. The goal is to find a population i such 
that Xi = 1 as quickly and reliably as possible. The prior 
probability of a particular population i following Pi or Pq is 
i.i.d., and denoted 

P(J5Q = 1) = TT 

P (X t = 0) = 1 — 7T 

where we assume tt < 1/2 without loss of generality. 



Algorithm 1 Search for an atypical population 
initialize: i = 1, j = 1 
while atypical population not found do 

sample Y h3 

either 

1) re-sample population i; j = j + 1 

2) move to next population: i = i + 1, j = 1 

3) terminate: X t = 1 
end while 

output: I = i 



Also without loss of generality, a testing procedure starts 
at population i = 1 and takes one sample. The procedure 
then decides to either 1) take an additional sample of i = 1, 
or 2) estimate population i = 1 as following distribution P 
(deciding Xi = 0) and move to index 2, or 3), terminate, 
declaring population i = 1 as following distribution Pi 
(deciding X\ = 1). Provided the procedure doesn't terminate, 
it continues in this fashion, taking one of three actions after 
each sample is taken. As in JS}, the procedure does not revisit 
populations (which is well justified as each population is 
independent of all others). 

The performance of any testing procedure is characterized 
by two metrics: 1) the expected number of samples required 
for the procedure to terminate, denoted E[iV], and 2) the 
probability the procedure returns an index not corresponding 
to a population following Pi. We denote this probability as 



P, 



'(/€{<: X, = 0}) 



where / is a random variable representing the index on which 
the procedure terminates. 

Imagine that the procedure is currently sampling index i. 
For a given sampling procedure, if Xi = 1, the probability the 
procedure passes to index i + 1 without terminating is denoted 
/?, and the probability the procedure correctly declares Xi = 1 
is 1 — /3. Likewise, for any i such that X, = 0, the procedure 
falsely declares Xj = 1 with probability a, and continues to 
index i + 1 with probability 1 — a. In other words, provided 
the procedure arrives at population i, 

/3 = P(X l = 0|X = l) 
a = F(Xi = l\Xi = 0). 

In essence, the procedure consists of a number of simple 
binary hypothesis tests, each with false positive probability 
a and false negative probability f3. 

The following recursive relationships will be central to our 
performance analysis. Let iVj be the (random) number of 
samples taken of population i, and N = YnLi be the total 
number of samples taken by the procedure. We can write the 
expected number of samples as 

E[N] = E[Ni] + (1) 
E \N 2 + N 3 + ... Xi = ol ((1 - tt)(1 - a) + 7T/3) 



where (1 — 7r)(l — a) + ir/3 is the probability the procedure 
arrives at the second index. The expected number of samples 
used from the second index onwards, given that the procedure 
arrives at the second index (without declaring / = 1), is simply 
equal to the total number of samples: E[iV2 + N3 + ... \Xi = 
0] = E[N]. Rearranging terms in ([T]i gives the following 
relationship 



E[N] 



E[Ni 



a(l - tt) + tt(1 - P) ' 



(2) 



For simplicity of notation, denote the expected number of 
measurements conditioned on the binary label as 



Pi = E[7Vi|Xi = 1] 



and thus, 



E[N] = 



E = E[jVi|Xi = 0] 



ttPi + (1 - tQPq 
a(l - tt) + tt(1 - (3) ' 



(3) 



In the same manner we arrive at the following expression for 
the probability of error: 



P, = 



a(l — tt) 



a(l - tt) + tt(1 - ft) 
1 



1 + 



tt(1-/3) 



(4) 



a(l — 7r) 

From this expression we see that if 

a(l — 7r) 



tt(1-/3) 



> S 



for some 5 > 0, then P e > j^, and P e is greater than or 
equal to some positive constant. 

Lastly, the bounds derived throughout often depend on 
explicit constants, in particular the Kullback-Leibler diver- 
gence between the distributions, which is defined in the usual 
manner: 



P(P ||Pi)=E 



log 



Po(X) 



Pi(Y) 



Other constants are denoted by C\, C[, etc., and represent 
distinct numerical constants. 

III. Lower bound for any procedure 

We begin with a lower bound on the number of samples 
required by any procedure to find a population following 
distribution Pi . Before stating the main theorem of the section, 
we present a number of corollaries of Theorem [5] aimed 
at highlighting the explicit relationship between the problem 
parameters. 

Corollary 1. Any procedure with 



Pe < 



6 



also has 

E[N] 



> 1 ( 1 I lo ( 1 
~ D(P \\Pi) \12tv + 3 ° 8 



-1 (5) 



p, < 



1 + 6 



also has 



7T (1 + 6) 
Mai?) 



1-6 



for any 5 < 1/2. Here, we assume D(P a \\Pi) = D(Pi\\P ) Theorem 5. Any procedure with 
for simplicity of presentation. ^ 

Proof of Corollary [T] follows immediately from Theorem [5] as 
7T < 1/2 and 5 < 1/2. 

Corollary [T] provides a particularly intuitive way to quantify 
the number of samples required for the quickest search prob- 
lem. The first term in pi, which has a 1/vr dependence, can 
be interpreted as the minimum number of samples required 
to find a population following distribution Pi. The second 
term, which has a \ogS~ 1 dependence, is best interpreted as 
the minimum number of samples required to confirm that a 
population following Pi has been found. 

When the populations following distribution Pi become rare 
(when 7r tends to zero), the second and third terms in |5]) 
become small compared to the first term. This suggests the 
vast majority of samples are used to find a rare population, 
and a vanishing proportion are needed for confirmation. The 
corollaries below capture this effect. The leading constants are 
of particular importance, as we relate them to upper bounds in 



D(Po\\Pi) 



(6) 



D(Pi\\Po) 
D(Po||Pi 



r>(Pi||p ) \ l + s 



D(Pi\\P ) 



for any 5 € [0,1/2]. 

Proof: Assume that P e < and from Q we have 

a(l — 7r) 



tt(1-/3) 



< 5. 



(7) 



From (|2j, 



Sec. IV In the following, consider P e and E[N] as functions 
7r, Pq, Pi, and some sampling procedure A. 

Corollary 2. Rare population. Fix 6 e (0,1/2]. Then any 
procedure A that satisfies 

lim sup P e < 



ttEi + (1 - n)E nE 1 + (1 - n)E 



E\N] = — — 

[ J a(l - tt) + tt(1 - p) ~ {1 + 5W-P) 

Ei (1 - tt)E 

(l + 5)(l-(3) (1 + ^(1-/3)' 



(8) 



1 



also has 



liminf 7rE[AH 

TT— >0 



From standard sequential analysis techniques (see Theorem 
2.29 of p6) ) we have the following identities relating the 
expected number of measurements to a and (3, which hold 
for any binary hypothesis testing procedure: 



> 



(l-6f 



(1 + 6) \D(P \\Pi) 
The proof of Corollary [2] follows from Theorem [5] by noting 
both the second and third terms of |7]l are overwhelmed as 
7r becomes small. The lower bound in in Corollary [2] states 
that any procedure requiring fewer than order 1/ir samples is 
unreliable, and is best interpreted in two regimes: (1) the high 
SNR regime, when D(P \\Pi) > 1, and (2), the low SNR 
regime, when D (Pol I -Pi) < 1- 

Corollary 3. High SNR. When D(P \\P 1 ) > 1, any proce- 
dure with lim 7r _ s .o P e = also has lim 7r _ i .o 7r E[N] > 1. 

The proof follows from Corollary [2] Corollary [3] states that 
any procedure requiring fewer samples in expectation than tt^ 1 
also has probability of error bound away from zero. The bound 
becomes tight when the SNR becomes high - when D(Pq\ \Pi) 
is sufficiently large, we expect to classify each population with 
one sample. 

Corollary 4. Low SNR. If D(P a \\P 1 ) < 1, any procedure 
with lim 7r _ ! .o P e = also has limjr^o 7rE[-/V] > 1/D(Pq\\Pi). 

Again the proof follows from Corollary [2] The Corollary 
simply indicates the following: in the low SNR regime the 
sampling requirements are at best an additional factor of 
D (Poll-PiT 1 higher than when we can classify each distri- 
bution with one sample. 

Next we state a general lower bound in the main theorem 
of the section. 



E x > 

E > 
Rearranging dHl 



/Jlog^) +(!-/?) log (i^) 



D(Pi\\Po) 



l-a 



E[N] > 



a log ^ J + (l-a) log 
D(Po\\Pi) 



(9) 
(10) 



(1 + 6)(1 ~ p)D{Pi\\P ) (1 + 5)D(Pi\\P ) 



Pi 



T-i 



(1-tt) (alog( T ^)+(l-a)log(i=^)) 
n(l + S){l-/3)D{P Q \\Pi) 



We first bound Xi as 

Pi > 



T 3 



1 -1 

> 



(l + S)D(Pi\\P ) ~ D(Pi\\P ) 
since for all f3 £ [0, 1], 

/31ogTf^ ^ /31og/3 



(11) 



> ij^ji. > -i. 
1-/3 " 1-/3 " 



From 0, 



P 2 > 



(1 + 6)D(P 1 \\P ) 



Next, differentiating T 3 with respect to a gives 



d{T 3 ) 
da 



1-5)-k{1-P)D(P \\P 1 ) 

showing that the expression is non-increasing in a over the 
set of a satisfying < ^ Z ^ L - From (^J, we are restricted 



to 



, < and thus, if ^- < 



increasing in a. To show this, note that 

Stt „ . .. „ . _ a(l — 7r 



/3 



, then ( 10 1 is non 



1 



< 8< 1-8 < 1 



7T 



tt(1-/3) 



< 1 - a < 



since both 8 < 1/2 and n < 1/2. We can replace a in (10 1 

g7T(l-|9) 
1-7T 



with 



la 



This gives 
5 log 



> 



1-7T 



(l + 5)£>(P ||Pi) 
(l-)(l-^)log(^ 



5tt(1 — /3) 
/?(l-7r) 



> 



7 r J D(P ||P 1 )(l + 5)(l-/3) 



(l + 5)£>(P ||Pi) 
(l-7r)(l-5)log(i 







> 



7Tl?(i%||i\)(l+ S)(l-P) 
^Og(l^) , (1-7T)(1 



<5) 2 



(1 + WPollPO 7ri5(Po||Pi)(l + *) 

where the first inequality follows from making the substitution 
for a and from (T7J, and the second inequality follows since 
7r/(l — 7r) < 1 and 1 — /3 < 1, and the last inequality follows 

as 



log(i 







> 1 



(1-/3) 

for all (3 £ [0, 1]. To see the validity of ( fT2| ), we note 



(12) 



<9 



(1 -/?)(!-*) 



(*) 

> 



(1 -/?)(! -5) 



1-/3 

(8 



log 1 



1-/3 

log(l//3) 



1-/3 

> 1. 

Here (*) follows by noting that log (1 + x/(3) fx is monoton- 
ically decreasing in x, and by setting x — 1 — f3. 

We can also trivially bound Eq by noting that Eq > 1. This 
provides an additionally bound on T3: 

(1-tt) 



7a > 



> 



tt(1 + 5)(1-/3) 



<5 loe 



(i-7r)(i-5r 



(l + 5)D(P \\P 1 ) ir(l + 5) 



(13) 



since the first term in ( 13 1 is strictly negative. 

Combining the bounds on T\ and T2, and the two bounds 
on T 3 , and noting that 8ix/(l — it) < 25it gives 

7T (1 -5) 2 



E[N] 



> 



1 



7T (1 + 5) 

log (&) ^ 

£>(Pl||P 



max 1, 



1 



WHPi) 



Q Q(Pq||Pi) 

1 + 5 



1 

WHPo) 



completing the proof. 



IV. S-SPRT Procedure 

The Sequential Probability Ratio Test (SPRT), optimal for 
simple binary hypothesis tests in terms of minimizing the 
expected number of samples for tests of a given power fT7) , 
can be applied to the problem studied here by implementing a 
series of SPRTs on the individual populations. For notational 
convenience, we refer to this procedure as the S-SPRT. This 
is equivalent in form to the CUSUM test studied in [8], which 
is traditionally applied to change point detection problems. 

The S-SPRT operates as follows. Imagine the procedure 
has currently taken j samples of population i. The procedure 
continues to sample population i provided 



7l < < 7u- (14) 

where iij := nl=i Po(y'fc) is the likelihood ratio statistic, 
and 7u and 7l are scalar upper and lower thresholds. In words, 
the procedure continues to sample population i provided the 
likelihood ratio comprised of samples of that population is 
between two scalar thresholds. The S-SPRT stops sampling 
population i after iV, samples, which is a random integer 



representing the smallest number of samples such that (14i 
no longer holds: 



linjj : t id < 7l (J ti,j > 7u} 



When the likelihood ratio exceeds (or equals) 7u, then Xj = 1, 
and the S-SPRT terminates returning I = i. Conversely, if the 
likelihood ratio falls below (or equals) 7l, then = 0, and 
the procedure moves to index i + 1. The procedure is detailed 
in Algorithm [2] 

Algorithm 2 Series of SPRTs Procedure (S-SPRT) 

input: thresholds 7l, 7u, distributions P , Pi 
initialize: i = 1, j = 1, I = 1 
while I < 7u do 
measure: Y h3 
compute: £ = I ■ p^yH] 
if t < 7l then 

i = i + 1, j = 1, I = 1 
else 

3=3 + 1 
end if 
end while 

output: I = i 



The S-SPRT procedure studied in [8 1 fixes the lower thresh- 
old in each individual SPRT at 7l = 1, which has a very 
intuitive interpretation. Since there are an infinite number 
of populations, anytime a sample suggests that a particular 
population doesn't follow Pi, moving to another population 
is best. While this approach is optimal (SJ, we use a strictly 
smaller threshold, as it results in a simpler derivation of the 
upper bound. 

In the following theorem and corollary we assume a minor 
restriction on the tail distribution of the log-likelihood ratio 
test statistic, a notion studied in depth in [18|. Specifically, 
let L = log(Pi(Y) / P (Y)) be the log-likelihood statistic. We 
require that 



max E IX 

r>0 



\L > r] < oo 



and 



min E[L + r\L < 

r>0 



-r > 



(15) 



(16) 



This condition is satisfied when L follows any bounded 
distribution, Gaussian distributions, exponential distributions, 
among others. It is not satisfied by distributions with infinite 
variance or polynomial tails. A more thorough discussion of 
this restriction is studied in JT8J . 

Corollary 6. Rare population. Fix 6 £ (0, 1/2 

with any 7l € (0,1) and ju = satisfies P t 

Ci 



The S-SPRT 

< t4t and 



lim TrEfAH 



< 



D(P \\Pi) 

for some constant C\ independent of tt and S. 

The proof of Corollary [6] is an immediate consequence of 
Theorem [7] Note that 7u > 1, since we assume that tt < 1/2. 
As the atypical populations become rare, sampling is domi- 
nated by finding an atypical population, which is order tt^ 1 . 
The constant factor of C\/ D{Pq\\Pi) is the multiplicative 
increase in the number of samples required when the problem 
becomes noisy. C\ can be explicitly calculated in a number 
of scenarios (see Examples 1 and 2). 

Theorem 7. The S-SPRT with 7 L e (0, 1) and 7u = 

S S [0, 1 /2] satisfies 

5 



1-7T 
7T(5 ' 



P, < 



1 



and 



E[N] < 



log 



l 

7T<5 



c 2 



■kD{P q \\P{) £>(Pi||P ) D(Pi\\Po) 
for some constants C\ and C 2 independent of tt and 8. 



(17) 



Proof: The proof is based on techniques used for analysis 
of the SPRT. From [16], the false positive and false negative 
events are related to the thresholds as: 

a<7u 1 (l-i8)<7u 1 = T Z ^z (18) 



1 



P < 7l(1 - a) < 7l- 



(19) 



From Q the probability the procedure terminates in error 
returning a population following Pq is 



Pr < 



1 



7„ (1~?0 



1 + 5 



(20) 



To show the second part of the theorem, first define the log- 
likelihood ratio as 



L 



(3) _ 



J] log 



fc=i 



Po(Y itk y 



(21) 



By Wald's identity [16|, 

-Eo \Z.™~ 

E -- 



D(Po\\Pi) 



(1 - a)E 



-L 



x = 



aEn 



—L 



X = 1 



D(Po\\Pi) 

The expected value of the log-likelihood ratio after Ni samples 
(i.e, when the procedure stops sampling index i) is often 
approximated by the stopping boundaries themselves (see 
1 16]). In our case, it is sufficient to show the value of the 
likelihood ratio when the procedure terminates or moves to 
the next index can be bound by a constant independent of tt 
and 8. From Jl7|, equations 4.9 and 4.10, for C[ > 0, 



and 



E n 



En 



1 = >iog 7L -c; 



L 



x = i\ <io g7u + c; 

where C[ is any constant that satisfies both 



(22) 



(23) 



and 



C\ < max E 

r>Q 



C\ < max E 

r>0 L 



iW-r 


L« >r 







C[ depends only on the distribution of and is trivially 



independent of 7l and 7u. Under the assumptions of ( 15 1 and 



(16 1, the constants are finite. C[ can be explicitly calculated 
for a variety of problems (see Examples 1 and 2, and |19|, 
p. 145, and fl8)). C[ is a bound on the overshoot in the log- 
likelihood ratio when it falls outside 7^7 or 7^. We have 

(1 - a) (C[ + log^L 1 ) + a (-C[ + log^u 1 ) 

D(Po\\Pi) 
(l-q)(C;+log7 I : 1 ) 
D(Po\\Pi) 

where the second inequality follows as 7u > 1. Likewise, 

(1 - p){C 2 + log 7u ) + p{-C 2 + log 7L ) 



Eo < 



< 



E x < 



< 



D(Pi\\Po) 
(l-^)(^+log7u) 
D(Pi\\P ) 



for some constant C 2 > which represents the overshoot of 
the log-likelihood ratio given X, = 0. Combining these with 
Q bounds the expected number of samples: 

D(Pi\\Po) 



E[N] 



< 



< 



a _ x (i-a)(c;+io g7 - 1 ) 
y 1 n ) d(p \\pi) 



a(l 



f tt(1-/3) 

c(+io g7l : 1 



£>(Pi||P ) 7T(l-7t)i?(Po||J\) 

tt£)(Po||Pi) £>(Pi||Po) £>(Pi||Po) 
where the second inequality follows from dropping a(l — 7r) 



< 



from the denominator, replacing j3 with the bound in ( 19 1, and 
dropping (1 — a) (I— it) from the numerator of the second term. 
The third inequality follows from defining Ci — C' 2 , and 

1 — 7l 



Ci = 



(24) 



Example 1. Searching for a Gaussian with positive mean. 
Consider searching for a population following P\ ~ 1) 
amongst a number of populations following Pq ~ Af(—fi, 1) 
for some /i > 0. The Kullback-Leibler divergence between the 
Gaussian distributions is D(P Q \\P 1 ) = 2/i 2 . From Jl9), p.145, 
we have an explicit expression for C[: 

C'M = 2^+ r ^_ t2/Ht y (25) 

In order to make our bound on E[N] as tight as possible, we 
would like to minimize C\ from ( |24| with respect to 7l. Since 
the minimizer has no closed form expression, we use the sub- 
optimal value 7l = for fj, > 1, and 7 l = /x for /z < 1. 
For this choice of 7l, the constant Ci = Ci(/i) in Theorem 
and ((24]) is 

giW+jgiW if „ -> i 
' — 

Cl(/x)+log(l/M) :f „ ^ 1 
l-,u 11 / x ^ L - 

Consider the following two limits. First, as /i — > oo 



CM 



lim , 

M^oo D(P Pi) 



1. 



As a consequence (from Corollary |6]l 

Ci(/i) 



lim lim 7r E [JV] < lim 



P 



Corollary [3] implies this bound is tight. As /i tends to infinity 
we approach the noise-free case, and the procedure is able to 
make perfect decisions with one sample per population. As 
expected, the required number of samples grows as 1/ir. 
Second, as /j, — > 0, 

lim C\(p) = 1 

fj,— >o 

which implies (again from Corollary |6]l 



lim lim 7r £)(Po||Pl) E[JV] < lim Ci(/z) 



1. 



Comparison to Corollary [4] shows the bound is tight. For 
small 7T, the S-SPRT requires 1/(tt-D(Po||Pi)) samples as the 
distributions become very similar, and no procedure can do 
better. 

Fig. [TJplots the expected number of samples scaled by it as a 
function of /i. Specifically, the figure shows three plots. First, /j, 
vs. 7rE[7V] obtained from simulation of the S-SPRT procedure 
with 7r = 1CT 3 , 7l = 1 and 7u = for 8 = 10~ 2 is 
plotted. Second, the lower bound from Theorem [5] is shown. 
For small it, from Q, any reliable procedure has 

E\N] > - max [1, , — 
1 J ~ n \ '£>(P ||Pi 

Lastly, the upper bound from Theorem[7]is plotted. From ( 17 1, 
for small values of tt, the S-SPRT achieves 



E[N] 



< 



Ci 



7r£>(P ||Pi)" 



where Ci is calculated by minimizing (24i over values of 
A E (0, 1) for each value of /i. C\ is within a small factor of 
the lower bound for all values of \x. 



10' 



10 



10" 



C 1 /D(P \\P 1 ) 

lower bound 
simulation 




Fig. 1 . Expected number of samples scaled by it as a function of the mean 
of the atypical population, fi, corresponding to example 1 . Simulation of the 
S-SPRT is plotted with the upper bound from Corollary [6] and lower bound 
from Corollary|2] Simulation details: it = 10~ 3 , P e < 10~ 2 , 10 3 trials for 
each value of /i. 



Example 2. Searching for a biased coin. Consider the problem 
of searching for a coin with bias towards heads of 1/2 + 6 
amongst coins with bias towards heads of 1/2 — b, for b 6 
[0, 1/2]. This problem was studied recently in |9). 

Corollary 8. Biased Coin. The S-SPRT procedure with 7l = 



^p§ and 7u = satisfies P e < ^ 



1-7T 



and 



E[N] < 



1 / 1 

262 



log 



Proof: The proof follows from evaluation of the constants 
in Theorem [7] The log-likelihood ratio corresponding to each 
sample (each coin flip) takes one of two values: if a coin 
reveals heads, ZA 1 ) — log 5355, and if a coin reveals tails, 
LM = log \+2b m When each individual SPRT terminates, it 
can exceed the threshold by no more than this value, giving, 



C[(b)= log 



1 + 26 
1 - 26 



C' 2 (b) = log 



1 + 26 
1 - 26' 



With 7l = jxfj, we can directly calculate the constants in 
Theorem [7] From ( |24] >, 



Ci(b) 
D(Po\\Pi) 

as the Kullback-Leibler 

D(P 1 ||P )=261ogi_ 26 

1 



1 + 26 



< 



1 



46 2 - 26 2 
divergence is D(P \\Pi) 



Also note 



Lastly, 



D(Pi\\Po) 



< 



1 

262' 



< 



1 

262' 



C 2 (b) C 2 1_ 

I>(Pi||Po) ~ £>(Pi||P ) ~ 26 
Combining these with Theorem [7] completes the proof. ■ 

Comparison of Corollary [8] to Theorem 2 of [9| shows 
the leading constant is a factor of 32 smaller in the bound 
presented here. 

Moreover, closer inspection reveals that the constant Ci(6) 
can be further tightened. Specifically, note that when an indi- 
vidual SPRT estimates Xi = it must hit the lower threshold 
exactly (since jl = (1 — 26) /(l + 26)). If we choose only val- 
ues of 5 such that the upper threshold is an integer multiple of 
the likelihood ratio (i.e., set log7u = k log((l + 26)/(l - 26)) 
for some integer k) the overshoot here is also zero. C[ = 
and C' 2 = 0, which then give 

C*i(6) 1 + 26 



D(Po\\Pi) 



From Corollary [6] 



lim tt K\N] < 



86 2 



1 + 26 
86 2 ' 



(26) 



For small tt, the number of samples required by any procedure 
to reliably identify an atypical population is 

1 /l + 26 s 



E[N] 



< 



86 2 



If 6 = 1/2 (each coin flip is deterministic), Ci/D(P Q \ |Pi) = 
1, and the expected number of samples grows as 1/tt as 
expected. The upper bound in Corollary [6] and lower bound in 
Corollary [T] converge. 

Likewise, as the bias of the coin becomes small, 
limfc^o Ci (6) = 1, and the expected number of sam- 
ples to reliably identify an atypical population grows as 
l/(irD(P \ |Pi)). Again the upper and lower bounds converge. 

Note that the S-SPRT procedure for testing the coins is 
equivalent to a simple, intuitive procedure, which can be 



implemented as follows: beginning with coin i, and a scalar 
static T = 0, if heads appears, add 1 to the statistic. Likewise, 
if tails appears, subtract 1 from the test statistic. Continue to 
flip the coin until either 1) T falls below 0, or 2) T exceeds 
some upper threshold (which controls the error rate). If the 
statistic falls below 0, move to a new coin, and reset the count, 
i.e., set T = 0; conversely if the statistic exceeds the upper 
threshold, terminate the procedure. Note that any time the coin 
shows tails on the first flip, the procedure immediately moves 
to a new coin. 



- lower bound 
simulation 




Fig. 2. Expected number of samples scaled by n as a function of the 
bias of the coin corresponding to Example 2. Upper and lower bounds from 
Corollaries [<| and [2] Simulation of the S-SPRT: tt = 10" 3 , P e < 10" 2 , 10 2 
trials for each value of b. 

Fig. [2] plots the expected number of samples scaled by tt 
as a function of the bias of the atypical coins, 6. The S-SPRT 
was simulated with the lower threshold set at 7l = log y+J| 
for tt = 1CP 3 and P e < 10~ 2 . The upper and lower bounds 
from Corollaries [6] and [T] are also plotted. The constant in the 
upper bound is given by the expression in (26 1. 

Notice that the simulated procedure appears to achieve the 
upper bound. Closer inspection of the derivation of Theorem 
[7] with C[ — (as the overshoot in ( 22 1 is zero), shows the 



bound on the number of samples required by the S-SPRT is 
indeed tight for the search for the biased coin. 

Remark 1. The S-SPRT procedure is fairly insensitive to our 
knowledge of the true prior probability tt. On one hand, if 
we overestimate tt by using a larger tt to specify the upper 



threshold ■fu 



'-, then according to d20b the probability 



of error P e increases and is approximately i^y/pj, while the 
order of E[N] remains the same. On the other hand, if our tt 
underestimates tt, then the probability of error P e is reduced by 
a factor of tt/tt, and the order of K[N] also remains the same, 
provided log(l/7r) < 1/tt, i.e., tt is not exponentially smaller 
than tt. As a consequence, it is sensible to underestimate tt, 



rather than overestimate ir as the latter would increase the 
probability of error. 

Remark 2. Implementing a sequential probability ratio test 
on each population can be challenging for many practical 
problems. While the S-SPRT is optimal when both Po and 
Pi are known and testing a single population amounts to a 
simple binary hypothesis test, scenarios often arise where some 
parameter of distribution Pi is unknown. Since the SPRT is 
based on exact knowledge of Pi, it cannot be implemented in 
this case. Many alternatives to the SPRT have been proposed 
for composite hypothesis (see (20), (21], etc.). In the next 
section we propose an alternative that is near optimal and also 
very simple to implement. 

V. Sequential Thresholding 

Sequential thresholding, first proposed for sparse recovery 
problems in (7), can be applied to the search for an atypical 
population, and admits a number of appealing properties. It 
is particularly well suited for problems in which the atypical 
distributions are rare. While sequential thresholding requires 
more samples than the S-SPRT, it does not require full 
knowledge of the distributions, specifically Pi, as required by 
the S-SPRT (see Remarks 2 and 4). Moreover, the procedure 
admits a general error analysis, and perhaps most importantly 
is very simple to implement (a similar procedure is used in the 
SETI project B), (6)). Somewhat surprisingly, the procedure 
can substantially outperform non-adaptive procedures as tt 
becomes small. Roughly speaking, for small values of tt, the 
procedure reliably recovers an atypical population with 



E[N] 



< 



log log 7T 1 

vr^(PollPi)' 



Algorithm 3 Sequential Thresholding 

input: integer k max , integer in, threshold 7 
initialize: i = 1, k = 1 
while k < fc max + 1 do 

measure: (Y h(k _ 1)m+1 , ...,Y^ km ) 
if T (Y it 

(k— l)m+l •>••••} Yi,km) < 7 then 
i = i + 1 
k = 1 
else 

k = k + 1 
end if 
end while 
output: Xi = 1 

Sequential thresholding requires three inputs: 1) k max , an 
integer representing the maximum number of rounds for any 
particular index, 2) in, an integer representing the number of 
samples per round and 2) 7, a threshold. Let T(Yi j i, ...,Fj m ) 
represent a sufficient statistic that does not depend on the 
parameters of Pi or Po (for example, in the Gaussian case, 
T(Yi i : Yi m ^ — 1 Yi j). 



The procedure searches for an atypical population as fol- 
lows. Starting on population i, the procedure takes m samples. 
If the sufficient statistic comprised of those in samples is 
greater than the threshold, i.e. T(Yi t -y, Yi m ) > 7, the 
procedures takes an additional block of in samples of index i 
and forms T(Fj 2 m+i, ••• ) ^,2m) (which is only a function of 
the second block of m samples). If T(Yi j 2 m +i, Yi^n) > 1, 
a third block of samples is taken. The procedure continues 
in this manner, re-testing the statistic up to a maximum 
of k max times. If the statistic is below the threshold, i.e. 
T < 7, after any sample, the procedure immediately moves 
to the next population, setting i = i + 1, and resetting k. 
Should any population survive all fc max rounds, the procedure 
estimates Xi = 1, and terminates. The procedure is described 
in Algorithm 3. 

Control of the probability of error depends on the threshold 
7, the number of rounds fc max , and the number of samples per 
round, m. Define the probability that the test statistic is below 
the threshold given the current index follows Po as 

p := P (T > 7) 

where p € (0, 1). Note that p is fixed and not a function of tt. 

Intuitively, the procedure can control the probability of 
error as follows. First, a can be made small by simply 
increasing fc max , as, by the independence of the blocks of 
samples, a = p k ™<™. Of course, as A: max is increased, j3 also 
increases. In order to control j3, m is increased. As we show 
in the following theorem, to control /?, it is sufficient to have 
m grow as log log 7T -1 . This log log 7r -1 can be interpreted 
as the penalty the sub-optimal procedure pays for increased 
robustness. The following theorem quantifies the number of 
samples required to recover an index following Pi. 

Theorem 9. Sequential Thresholding. For any p € (0,1), 



5 G (0, 1), and e > set k n 



lot 



i/p 



( tt6 ) 



and m 



(1+6) log k„ 
D{P \\Py 



. Sequential thresholding then satisfies 



lim P e < — 

TT^O 1 



and 



lim 

tt^o loglogi /p 



i/pTr- 1 E[N] - £>(P ||Pi)(i 



-p) 



Proof: Employing sequential thresholding, the false posi- 
tive event depends on the number of rounds as a = p fcmax . 
With fc max as specified, we have a < tvS/(1 — n). The 
probability the procedure returns a population corresponding 
to Po then follows from Q as 

Pe < -r 6 • (27) 

(5 + 1-/3 

Next, we show j3 tends to zero as tt becomes small. The 
Chernoff-Stein Lemma (22) states that since p is fixed, 

log Pi (T< 7 ) 



lim 

m— J- 00 



-D(P \\Pi), 



which implies 



lim logger < 7)= _ (l + £) 



■<->0 logfe max 

log fc n 
D(Pb||Fi 

/femax 



and lim^^o m — 00 ■ By definition, 



= Pi (J T < 7 < fc max Pi(T < 7). 



>fc=i 



where the inequality holds by the union bound. Therefore, we 
obtain 

logP 1 (T< 7 ) 



lim j3 < lim exp log (fc max ) 1 

tt— yO tt— ¥0 



log k n 



exp ( — e lim log k n 

\ 7T->0 



= 0. 



Combined with ( p7| ), we have 

limP e < — — . 

7T-yO 1+0 

The expected number of samples required for any index 
following Pq is given as 



E = £ m p*" 1 < - 



/?? 



fc=i 



On the other hand, the expected number of samples given the 
index follows Pi is less than m times the maximum number 
of rounds: 



From ([3]) we have 

E[7V] < 



E\ < mk n 



7rrnfc max + (1 - 
a(l-7r)+7r(l-/3) 



< 



mk„ 



m(l — 7r) 



(1-/3) n(l-p)(l-p) 
With fc max and m as specified, 

7r mk ml 



lim — — f — ^ = 

tt^o loglog 1/)9 7r 1 1-/3 



and 



lim 

71- — y 



m(l — 7r) 



logi/pTT-Ml-p) (1-/3) 

1 + e 



D(P ||Pi)(l-p)- 



implying 



lim 



1 + e 



tt^o loglog 1/p 



-/>)■ 



Remark 3. Similar to the behavior of the SPRT discussed in 
Remark 1, sequential thresholding is also fairly insensitive to 
our prior knowledge of tt, especially when we underestimate ir. 



More specifically, overestimating tt increases the probability of 
error almost proportionally and has nearly no affect on E[iV], 
while underestimating tt decreases the probability of error and 
the order of E[N] is the same as long as log(l/7r) < 1/tt. 

Remark 4. For many distributions in the exponential family, 
the log-likelihood ratio, Li, defined in (21 1 is a monotonic 



function of a test statistic T that does not depend on parameters 
of Pi. As a consequence of the sufficiency of T, the threshold 
7 depends only on Pq, making sequential thresholding suitable 
when knowledge about Pi is not available. 

Perhaps most notably, in contrast to the SPRT based pro- 
cedure, sequential thresholding does not aggregate statistics. 
Roughly speaking, this results in increased robustness to 
modeling errors in Pi at the cost of a sub-optimal procedure. 
Analysis of sequential thresholding in related sparse recovery 
problems can be found in ||7), (TT]. 

VI. Limitations of Non-Adaptive Procedures 

For our purposes a non-adaptive procedure tests each indi- 
vidual population with a pre-determined number of samples, 
denoted Ao- In this case, the conditional number of samples 
for each individual test is simply Eq = Ei — Nq giving 



E[N] = 



a(l - tt) +tt(1 - (3)' 



(28) 



To compare the sampling requirements of non-adaptive proce- 
dures to adaptive procedures, we present a necessary condition 
for reliable recovery. The theorem implies that non-adaptive 
procedures require a factor of log7r _1 more samples than the 
best adaptive procedures. 

Theorem 10. Non-adaptive procedures. Any non-adaptive 
procedure that satisfies 



also has 



for 5 < 1/2. 



E[JV] > 



tt{1 + S)D(Pi\\P q ) 



Proof: Assume that P e < and from Q we have 

a(l — tt) 



tt(1-/3) 



< S. 



(29) 



From d28l, 



E[N] > 



Nn 



tt(1 + *)(!-#• 



Next, for any binary hypothesis test with false negative a and 
false positive /3, the following identity holds: 



N > 



/31og( T ^)+(l-/?)log(^) 



D(Pi\\P ) 



(30) 



To see ( 30 », recall that for non-adaptive procedures, N = 
Eq — Ei, and thus both bounds in (|9]i and (10 1 apply. This 
gives 



E[N] 



> 



> 



> 



log 



7t(1 + S)(1-(3)D(Pi\\P ) tt(1 + S)D(P 1 \\P ) 

log (^) - 1 
ir(l + 5)D(Pi\\P ) 

ir(l + 6)D(Pi\\P ) 



where the second inequality follows from (111 and ( 29 1, and 
the last inequality as ir < 1/2. 



Remark 1. The lower bound presented in Theorem 10 implies 



that non-adaptive procedures require at best a multiplicative 
factor of log7r _1 more samples than adaptive procedures 
(as adaptive procedures are able to come within a small 
constant of the lower bound in Theorem|5]l. For problems with 
even modestly small values of n, this results in non-adaptive 
sampling requirements many times larger than those required 
by adaptive sampling procedures. 

VII. Conclusion 

This paper explored the problem of finding an atypical pop- 
ulation amongst a number of typical populations, a problem 
arising in many aspects of science and engineering. 

More specifically, this paper quantified the number of 
samples required to recover an atypical population with high 
probability. We paid particular attention to problems in which 
the atypical populations themselves become increasingly rare. 
After establishing a lower bound based on the Kullback Leibler 
divergence between the underlying distributions, the number 
of samples required by the optimal S-SPRT procedure was 
studied; the number of samples is within a constant factor of 
the lower bound, which can be explicitly derived in a number 
of cases. Two common examples, where the distributions are 
Gaussian and Bernoulli, were studied. 

Sequential thresholding, a more robust procedure that can be 
implemented with less prior knowledge about the distributions, 
was presented and analyzed in the context of the quickest 
search problem. Sequential thresholding requires a multiplica- 
tive factor more samples, doubly logarithmic in the prior, than 
the S-SPRT procedure. Both sequential thresholding and the 
SPRT procedure were shown to be fairly robust to modeling 
errors in the prior probability. Lastly, for comparison, a lower 
bound for non-adaptive procedures was presented. 
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